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Distributions 
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Abstract 

An expanded family of mixtures of multivariate power exponential distributions 
is introduced. While fitting heavy-tails and skewness has received much attention in 
the model-based clustering literature recently, we investigate the use of a distribution 
that can deal with both varying tail-weight and peakedness of data. A family of 
parsimonious models is proposed using an eigen-decomposition of the scale matrix. 
A generalized expectation-maximization algorithm is presented that combines convex 
optimization via a minorization-maximization approach and optimization based on 
accelerated line search algorithms on the Stiefel manifold. Lastly, the utility of this 
family of models is illustrated using both toy and benchmark data. 


1 Introduction 


Mixture mo dels have become t he most popular methodology to i nvestigate heterogeneity 
in data (cf. iTitterington et al.l. Il985l: iMcLachlan and Peell. l2000bl) . Model-based learning 
makes use of mixture models to partition data points. Model-based clustering and classi¬ 
fication refer to the scenarios where observations have no known labels and some known 
labels, respectively, a priori. The number of these partitions or clusters may or may not 
be known in advance . While approaches based on mix tures of Gaussian distributions (e.g., 
Banfield and Rafterv . 1993 : Celeux and Gova^. 1995 1 remain popular for model-based clus¬ 


tering, these algorithms are susceptible to performing poorly in the presence of outliers. As 
a result, more robust mixtures of distributions are becoming increasingly popular. Some 
of these mixtures aim t o tackle tail-weight (e.g., Andrews and McNichoM. 2011 . 20121: 


Forbes and Wraithl. l2014l) . some deal wi th skewness (e.g.. iLin et al. 


2014 ) , while others account for both fe.g..lKarlis and Santourianl. 


2009 


2014 : Vrbik and McNicholas . 2014 : Browne and McNicholas . 20151) . 
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Herein, we utilize a f amily of mixture mo dels based on the multivariate power exponen¬ 
tial (MPE) distribution (IGomez et all Il998l) . This distribution is sometimes also called the 
multivariate generalized Gaussian distribution. Depending on the shape parameter f3, two 
kinds of distributions can be obtained: for 0 < /5 < 1 a leptokurtic distribution is obtained, 
which is characterized by a thinner peak and heavy tails compared to the Gaussian distribu¬ 
tion; whereas, for /3 > 1, a platykurtic distribution is obtained, which is characterized by a 
flatter peak and thin tails compared to the Gaussian distribution. The distribution is quite 
flexible: for f3 = 0.5, we haye a Laplace (double-exponential) distribution and, for = 1, we 
have a Gaussian distribution. Furthermore, when f3 ^ oo the MPE becomes a multivariate 
uniform distribution. 

The MPE distr i bution has been u s ed in many different applications flLindseyl . Il999 : 


Gho and Bui . 2005 : Verdoolaege et ah . 2008h . However, due to difficulties in estimating 
the covariance over the entire support of the shape parameter jS G (0,oo), its potential 
has not yet been fully explored. This distribution presents a difficult parameter estima¬ 
tion problem because none of the parameter estimates are available in closed form. Previ¬ 
ously proposed estimation strategies ha ye included optimiz ation based on geodesic convexity 
for unconstra- i ned c ovariance matrices flZhang et ahl 120131) and Newton-Raphson recursions 
( Pascal et ah 20131). Some work with this distribution has focused on the special case wher e 
0 < /3 < 1 f Gomez-Sanchez-Manzano et ah . 2008 : Rombrnn et af.l. 2012 : Pascal et al.l. 2013). 
However, for imposing parsimony in a traditional model-based clustering context (through 
different constraints on terms of specific decompositions of the component scale, or covari¬ 
ance, matrices), these methods are not ideal. Previously, a famil y of five models based o n 
mixtures of MPE distributions has been used for robust clustering f Zhang and Lian^. 2010 ). 
This work made use of fixed point iterations for the special case where 0 < /? < 2 (see 
Appendix 1^. Within 0 < < 2, the hxed point algorithm converges; however, it yields 

monotonic improvements in log-likelihood only for 0 < /5 < 1. For /5 > 2, this fixed point 
algorithm is guaranteed to diverge, which leads to (negativ e) inhnite log-likelihood values. 


Herein, a generalized expectation-maximization (GEM; iDempster et ahl. 119771) strategy 


is proposed and illustrated. This algorithm works for 0 < /9 < oo. This estimation pro- 


(Hunter and Lange 

2000) and accelerated line search algorithms on the Stiefel manifold 

(Absil et ah. 

2009; 

Browne and McNicholas. 2014b). This allows for the estimation of a 


wide range of constrained models, and a family of sixteen MPE mixture models is presented. 
These models can account for varying tail weight and peakedness of mixture components. 
In Section [21 we summarize the MPE distribution. Section [3] gives a GEM algorithm for pa¬ 
rameter estimation. Section 0] investigates the performance of the family of mixture models 
on toy and benchmark data. We conclude with a discussion and suggest some ayenues for 
further research in Section O 
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Figure 1: Density plots for different values of 13. The MPE distribution is quite flexible: for 
(3 = 0.5, we have a Laplace (double-exponential) distribution and for (3 = 1, we have a Gaus¬ 
sian distribution. Furthermore, as /3 ^ oo, the MPE distribution becomes a multivariate 
uniform distribution. 
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2 Multivariate Power Exponential Distribution 


A ran dom vector X follows a p-dimensional power exponential distribution flLandsman and Valdezl . 
20031 ) if the density is of the form 


S,r,s) = Cp|S| 2 exp <{-—(5(a3) 


( 1 ) 


where 


Cp 


*r(i) 


rP/( 2 s) 


{2^)r/V (i) 

S{x) := 6 S) = (a; — ^)' (a; — /x), and S are the location parameter (also the 

mean) and positive-dehnite scale matrix, respectively, and r,s > 0. This elliptical distribu¬ 
tion is a multivariate Kotz-type distribution. However, it has identihability issues concerning 
S and r: the density with 0 = {/x, S*, r*, s}, where S* = S/2 and r* = r/2®, is the same 
as ([I|). 

Using the parametrization given by Gomez et al.l f 19981) . a random vector X follows a 
p-dimensional power exponential distribution if the density is 


/(ic|^,S,/3) =/c|S| texp <{ 


( 2 ) 


where 


k = 


pr(f) 


ttpAT (1 + ^ j 2^+^ 

5{x) := 5 {x\fj,, S) = (a; — /x)^S”^ (x — /x), is the location parameter (also the mean), S 
is a positive-dehnite scale matrix, and (3 determines the kurtosis. Moreover, it is a special 
parameterization of the MPE distribution given in ([1]), with r = 2^“^ and s = (3. The 
covariance and multidimensional kurtosis coefficient for this distribution are 


Cov(X) = 


2V/5r 




( 3 ) 


and 


72 (X) = 




2/3 


p2 f P+^ \ 
I 2/3/ 


V^-p(p + 2 ), 


( 4 ) 


respectively flGomez et ahl. 119981) . Here, 'j 2 {X) denotes the multidimensional kurtosis coef- 
hcient that is dehned as 

E I [(X - ^)'Var(X)-i(X - /^)]'} - p{p + 2) 
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(Mardia et ah. 

1980; 

Gomez et ah. 

1998). For 0 G (0,1), the MPE distribution is a scale 

mixture of Gaussian distributions ( 

Gomez-Sanchez-Manzano et ah. 20081). 


Based on the MPE distribution, a mixture model can conveniently be defined as 


G 

g{x\&) = TTgf {x\fj,g, Eg, /3g) , 
9=1 


where /(•) is the g^th component density and © denotes all parameters. Here, ^ig, Eg, and 
[3g denote the mean, scale matrix, and shape parameter, respectively, of the gi\i component. 
Here, tti, ..., ttg are the mixing weights such that TTg > 0 {g = 1,..., G) and J2g=i '^g — 
Note that mixtures of M PE distributions have previously been shown to be identifiable 
( Zhang and Liang . 2010h . 

Because the number of parameters in the scale matrix increases quadratically with data 
dimensionality, it is common practice to impose a decomposition that allows for reduction in 
the number of parameters to be estimated. An eigen-decomposition decomposes a component 


covariance matrix into the form 5E = A^gAgrg', where Ag, 


Pg, and Ag can be interpreted 


geometrically ( Banfield and Rafterv . 19931) . Specifically, Ag is a diagonal matrix with entries 
proportional to the eigenvalues of Eg (with |Ag| = 1), 


Ag is the associated constant of 


proportionality, and Pg is a p x p orthogonal matrix of the eigenvectors of Eg (with entries 
ordered according to the eigenvalues). Constraining these terms to be e qual or variable across 


group s allows for a family of fourteen parsimonious mixture models (ICelenx and Govaert 


19951 ). In this paper, we work with a subset of eight parsimonious models (EH, VH, EEI, 


VVI, EEE, EEV, VVE, and VVV), including the most parsimonious (EH) and the fully 
unconstrained (VVV) models (Table [T]). In addition, there is the option to constrain (3g 
to be equal across groups. This option, together with the covariances structures, results 
in a family of sixteen models. The nomenclature for this family is a natural extension of 
that used for the covariance structures, e.g., the model with a VVI scale structure and (3g 
constrained to be equal across groups is denoted VVIE. This family of models is referred to 
as the ePEM (eigen-decomposed power exponential mixture) family hereafter. 


3 Inference 


The expectation-maximization (EM) algorithm (jPempster et al.l. 119771) is an iterative pro¬ 


cedure based on the complete-data likelihood. At each iteration, the the expected value of 
the complete-data log-likelihood is maximized to yield updates for the parameters o f inter- 
est. The expectation-conditional-maximization (ECM) algorithm flMeng and R.nbinl. 119931) 
replaces the maximization step of the EM algorithm with a number of conditional maxi¬ 
mization (CM) steps. This might be necessary due to the form of the likelihood or because 
the conditional maximization steps are less computationally expensive. In our parameter 
estimation algorithm, CM steps are used within a framework that increases, rather than 
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Table 1: Nomenclature, scale matrix structure, and the number of free scale parameters for 
the ePEM family of models. 


Model 

A. 

A, 

r. 

s. 

Free Gov. Parameters 

Eli 

Equal 

Spherical 

- 

XI 

1 

VII 

Variable 

Spherical 

- 

X,I 

G 

EEI 

Equal 

Equal 

Axis-Aligned 

AA 

P 

VVI 

Variable 

Variable 

Axis-Aligned 


Gp 

EEE 

Equal 

Equal 

Equal 

AT AT' 

p{p + l) /2 

EEV 

Equal 

Equal 

Variable 

Ar,Ar; 

Gp{p+l)/2 - (G - l)p 

VVE 

Variable 

Variable 

Equal 

a.ta.t' 

p{p + l)/2 -h (G - l)p 

VVV 

Variable 

Variable 

Variable 


Gp {p + 1) /2 


maximizes, the expected value of the complete data log-likelihood at each iteration. Such 
an approach, i.e., one that has the latter feature, is called a GEM algorithm. The parameter 
updates associated with our GEM algorithm are given in Appendix [Bl 


4 Results 


For our numerical analyses, we use the Bayesi an information criterio n (BIG; ISchwara . Il978l ) 
and the integrated complete likelihood (IGL; Biernacki et ah . 2nnnh for model selection. A 


stopping criterion based on the Aitken accelera tion dAitken . 


19261 is used to determine 


convergence and the adjusted Rand index (ARI; [Hubert and Arabiel. Il985h is used for per¬ 
formance assessment. More details are in Appendix O In Appendix |Al we compare the 
performance of our algorithm to an algorithm based on fixed point iterations. 


4.1 Simulations 

For simulating from the MPE distribution, a mo dified version of the f unctio n rmvpowerexp 
from package MNM (INordhansen and Ojal . 129111 1 in R (IR Gore Teaml . 129131 1 is used. The 
function was modihed due to a typo in the rmvpow erexp code. This p rogram utilizes the 


stochastic representation of the MPE distribution flGomez et ahl. Il998ll to generate data. 
This works quite well in lower dimensions. In higher dimensions, a Metropolis-Hastings- 
based simulation rule can easily be constructed. We illustrate the performance of our family 
of models using simulations in a wide range of scenarios: for light-tailed components, for 
light- and heavy-tailed components, for data simulated from Gaussian and f-distributions, for 
higher-dimensional data, and for low overall sample size. When data are simulated from the 
MPE distribution only, we also show parameter recovery. For comparison to existing mix- 
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ture models based on elliptically contoured di stributions, the mixture ( Browne et ah . 2014|) 
and teigen ( Andrews and McNicholas . 2014J) packages in R are employed. These packages 
implement mixtures of Gaussian and mixtures of multivariate Student-t distributions, respec¬ 
tively. To facilitate a direct comparison, we restrict mixture and teigen to the analogues 
of the ePEM model s (Tabled]). Note that we use the mixture package rather than mclust 
( Fraley et ahl . 20121 ) because the VVE model is availa ble within mixture but no t within 
mclust, which only implements ten of the 14 mode ls of ICelenx and GovaertI (119951 ). More¬ 
over, as compared to Rmixmod (ILebret et al.l . l2012h. certain models in the mix ture family 
are better optimized for higher dimensions (cf. Browne and McNicholasl . 2014a ). Note that 
the teigen package additionally allows for constraining of the degrees of freedom parameter 
(z/). Hence, a VVIV model implies that \g, Ag, and Ug are different between groups, and 
is the identity matrix. Note that the same starting values are used for all three algorithms, 
i.e., for each G, the initial r^g are selected from the best fc-means cluster i ng re sults from ten 
random starting values for the /c-means algorithm (IHartigan and Wond. 119791) . 


Simulation 1: Two light-tailed components A two-component mixture is simulated 
with 450 observations with the sample sizes for each group sampled from a binomial distribu¬ 
tion with success probability 0.45. The first component is simulated from a two-dimensional 
MPE distribution with zero mean, identity scale matrix, and jSi = 2. The second component 
is simulated from a two-dimensional MPE distribution with mean (2, 0)', identity scale ma¬ 
trix, and (32 = 5. Note that this corresponds to an EIIV model. The simulated components 
are not well separated. All three algorithms are run on 100 such data sets. For the ePEM 
family, a two-component model is selected by the BIG (and the IGL) for each of the 100 
data sets. On the other hand, for the mixture family, the BIG selects a two-component 
model 77 times, and three, four, and hve component models are selected 15, 6, and 2 times, 
respectively. Similarly, for the teigen family, two, three, four, and hve component models 
are selected 61, 10, 26, and 3 times, respectively. Glearly, for both of the latter families, 
more components are being htted to deal with the light-tailed nature of the data. 

For the ePEM family, the EIIV model is selected by the BIG 97 times out of 100, with the 
VIIE model selected the other 3 times. The ARI values for the selected ePEM models range 
from 0.81 to 0.95, with a median (mean) ARI value of 0.88 (0.88). The selected mixture 
models yield ARI values ranging between 0.30 and 0.96, with a median (mean) value of 0.85 
(0.79). Similarly, the teigen family yields ARI values ranging between 0.29 and 0.94, with 
a median (mean) value of 0.80 (0.69). A contour plot shows the ht of a selected EIIV model 
to an example data set (Figure |2|). The estimated mean, variance (using (|2|)), and (3 are 
given in Table [2J Clearly, the estimates are quite close to the true parameter values. 

The impact of multiple initializations in terms of the model and number of components 
selected is also evaluated. Here, the /c-means initialization mentioned above is repeated 25 
times for all 100 simulated data sets. In all cases, the same model is selected (by the BIG) 
for all 25 runs. Hence, hereafter, only one fc-means initialization (as explained in Section ITT]) 
is used for all simulated and real data. 
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Table 2: True parameter values along with mean and standard deviations of the parameter 
estimates (rounded off to 2 decimals) for the selected model from the 100 runs for Simulation 

1 . 


Parameter 

True values 

Mean estimates 

Standard deviations 

TTl 

0.45 

0.45 

0.03 

T^2 

0.55 

0.55 

0.03 

/^1 

(0,0)' 

(-0.01,0.00)' 

(0.05,0.04)' 


(2,0)' 

o 

1 

o 

o 

o 

(0.03,0.02)' 

Vari 

0.40 

0.40 

0.02 

Var2 

0.28 

0.28 

0.01 

/Si 

2 

2.10 

0.39 

/S2 

5 

5.77 

3.06 




Figure 2: Plots showing the generated data (top) and the htted density (bottom) using the 
selected model from the ePEM family for Simulation 1. This hgure appears in color in the 
electronic version of this article. 












Simulation 2: Light and heavy-tailed components A three-component mixture is 
simulated with 500 observations in total. Group sample sizes are sampled from a multinomial 
distribution with mixing proportions (0.35, 0.15, 0.5)'. The first component is simulated from 
a 3-dimensional MPE distribution with mean (0, 2, 0)' and fii = 0.85. The second component 
is simulated from a 3-dimensional MPE distribution with mean (2, 5, 0)' and (32 = 3. Lastly, 
the third component is simulated from a 3-dimensional MPE distribution with mean (4, 2,0)' 
and /da = 5. To generate the scale matrices (using an EEEV scale structure), we use 


t, = T2 = t. 


0.36 

0.48 

- 0.8 

- 0.8 

0.6 

0 

0.48 

0.64 

0.6 




Ai = A 2 = A 3 = diag(4, 3,1), where diag(-) refers to a diagonal matrix. 

For all three families, the BIG selects a three-component model for each of the 100 runs. 
For the ePEM family, the BIG selects an EEEV (WEE) model 99 (1) times. The ARI 
values for the selected models from the mixture family range between 0.87 and 0.96 with a 
median (mean) value of 0.92 (0.92). Similarly, the teigen family yields ARI values between 
0.85 and 0.96 with a median (mean) value of 0.92 (0.91). Even though all three families 
select the same number of components every time, the estimated ARI values for the selected 
ePEM models are higher, ranging between 0.91 and 0.98 with a median (mean) value of 0.94 
(0.94). A scatter plot showing an example of the generated data is given in Figure El The 
estimated mean, covariance, and (3 are given in Table El 


Simulation 3: Higher-dimensional data Here, parameter recovery is illustrated for 
the ePEM family on higher dimensional data. One-hundred sa mples of a thir t y dirn ensional 
two-component mixture model are simulated in the fashion of Murray et al.l ( 20141) . Group 
sample sizes are sampled from a binomial distribution with success probability 0.35 and an 
overall sample size of 400. The first component is simulated from a 30-dimensional MPE 
distribution with zero mean. The second component is simulated from a 30-dimensional 
MPE distribution with mean (3, 3, 3)' G) lio, where lio denotes a column vector of length 10 
with all entries equalling 1. The common scale matrix is generated using 


/ 1 0.1 0 . 2 \ 

0.1 1.5 0.3 0 / 10 , 

\0.2 0.3 1.2/ 


where 1 10 denotes a 10-dimensional identity diagonal matrix. The recovered parameter 
estimates are found to be close on average to the generating parameters. Due to the di¬ 
mensionality, we follow Murray et al.l ( 2014 ) and report the Frobenius norms of the biases 
of the parameter estimates in Table IH Clearly, the estimated parameters are quite close to 
the generating parameters. Note that while the purpose of this simulation is to investigate 
parameter estimation in higher dimensions, all 100 runs yield perfect clustering. 
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Figure 3: Scatter plots showing an example of a generated data set for Simulation 2. 

Simulation 4: Gaussian and t-components Here, we show that the ePEM family can 
recover Gaussian and t-components favourably when compared to the mixture and teigen 
families. A two-component mixture is simulated with 100 observations, where the group 
sample sizes are sampled from a binomial distribution with success probability 0.4. The hrst 
component is simulated from a 3-dimensional Gaussian distribution with zero mean. The 
second component is simulated from a 3-dimensional t-distribution with mean (5, 0, 0)' and 
5 degrees of freedom. Both components are generated using the same scale matrix: 

/ 1 0.5 0.25\ 

0.5 1 0.3 . 

\0.25 0.3 1 / 

The algorithms are run for G = 1,..., 5. The mixture family does not perform well over 100 
runs. One through Eve component models are chosen 1, 52, 31, 14, and 2 times, respectively. 
In contrast, for the ePEM family, a two (three) component model is selected 89 (11) times. 
On the occasion when a three-component model is selected, the low overall sample size 
seems to contribute to some observations from the heavy-tailed component being clustered 
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Table 3: True parameter values along with mean and standard deviations of the parameter 
estimates (rounded off to 2 decimals) for the selected model from the 100 runs for Simulation 
2 ^_ 

Parameter True values Mean estimates Standard deviations 


TTl 

T^2 

TTs 

Ml 

M 2 

M3 

Covariance 1 


Covariance 2 


Covariances 


/Si 

/S 2 

/Ss 



-0.45 

5.64 

-0.59 

-0.08 

0.97 

- 0.10 

-0.07 

0.83 

-0.09 



0.35 

0.15 

0.5 

-0.02,1.97,0.01)' 
(1.99,4.98,0.00)' 
(4.00,2.00,0.01)' 
-0.41 1.75 

5.71 
-0.57 
-0.07 
0.98 
- 0.10 
-0.06 
0.83 
-0.08 



in their own unique group. Similarly, for the teigen family, a two (three) component model 
is selected 88 (12) times. Over the 100 runs, the EEEE (EEEV) model is selected 70 (21) 
times. Given the generated data, a model with varying f3g might be expected from the 
ePEM family; however, in a few runs, the selected models have heavy tailed components 
with equal [3g. This may be due to the small overall sample size and/or the fact that the 
generated components are not clearly separated. The ARI values for the selected models for 
the mixture family over the 100 runs range from 0 (for the one-component model) to 1, with 
a median (mean) ARI of 0.94 (0.90). Similarly, the selected models from both the teigen 
and ePEM families yield ARI values ranging between 0.57 and 1, with a median (mean) 
value of 0.96 (0.94). A scatter plot showing an example of the generated data is given in 
Figure HI 


Assessing the impact of outliers We follow iMcLachlan and Peel! fl2000b[) in assessing 
the impact of outliers on the clustering performance of the ePEM family as compared to 
the Gaussian mixture models impl e ment ed in the mixture package. The crab data set, 
introduced in ICampbell and Mahon! f) 19741) . consists of Eve-dimensional observations on crabs 
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Table 4: True parameter values along with the Frobenius norms of the biases of the parameter 
estimates (rounded off to 2 decimals) for the selected model from the 100 runs for Simulation 


3 . 



True values 

||Bias|| 


0.35 

0.00 


0.65 

0.00 


(0,0,0)'®lio 

0.02 


(3,3,3)'® lio 

0.05 


O 

o 


0.1 

1.5 0.3 ®/lo 

0.26 

0.2 

0.3 1.2/ 



o 

o 


0.1 

1.5 0.3 0 Jio 

2.55 

0.2 

0.3 1.2/ 



2 

0.34 


0.95 

0.08 


Parameter 


TTl 

T^2 

Covariance 1 


Covariance 2 


/5i 

^2 


2i//^2r(2^) 




X 


X 


of the genus Leptograpsus and can be obtained from the MASS package fIVenables and Riplevl . 


20021 ). Measurements are recorded on the width of the front lip, the rear width, the length 
along the middle, the maximum width of the carapac e, and the body dep t h. The subset 
of blue crabs (50 males and 50 females) is analyzed in McLachlan and Peel f 2000b[) . where 
outliers are introduced, and a Gaussian model with a common covariance matrix as well as 
a f-mixture model with equal scale matrices and equal degrees of freedom are htted. The 
outliers are introduced by adding various values to the second variate of the 25*^ point. 
We replicate this analysis to investigate the performance of the ePEM models compared 
to the mixture models. Note that the EEEE model from the teigen family is also htted 
but does not perform well (a minimum of 37 misclas sihcations; results no t shown ). This 
is probably due to different starting values; however, McLachlan and Peel f 2000bl) do not 
provide information on the starting values used for their comparison and we are unable to 
obtain results similar to theirs. On the original data, Gaussian EEE and MPE EEEE two- 
component models yield 19 misclassihcations each. However, as the value of the constant 
that is added to the observation of interest is increased or decreased, the MPE component 
model error rate is much smaller than that of the Gaussian mixture. However, both the 
Gaussian mixture and MPE approach suher when the constant by which the value is jittered 
is extreme. 


12 
























Figure 4: Scatter plots showing an example of a generated data set for Simulation 3. 


4.2 Real Data 

We also test our algorithm’s performance on several real benchmark data sets. The body, 
diabetes, female voles, and wine data sets are commonly used for illustration in the 
model-based clustering literature. We also consider two bioinformatics data sets: the srbct 
data and the glob data. The body data contain 24 measurements on body dimension, age, 
weight, and height for 507 individ uals (247 men and 260 women), and can be obt ained from 
the gclus package f Hurley . 2012 ). The diabetes data f Reaven and Mill^ . 1979 ). obtained 
from mclust, contains three measurements on 145 subjects from three classes: chemical (36 
obser vations), normal (76 observa tions), and overt (33 observations). The female voles 
data f Airoldi and Hoffmann . 1984|l contain seven measurements on age and skull size of 86 
females of two species of voles: Microtus californicus (41 observations) a nd M. Ochro qaster 
(45 observations). The se data are available as part of the Flury package I Flnry . 2ni2h in R. 
Lastly, the wine data f Forina et al.l. 1988h contain 13 measurements on 178 wines of three 
types (barolo, grignolino, and barbera), and can be obtained from the gclus package. 

The srbct data contain gene expression microarray data from experiments on small 
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Table 5: Comparison of error rates from the Gaussian and MPE mixture models fitted to 
modified crabs data. 


Gonstant 

Gaussian 

MPE 

/3 

-15 

37 

35 

0.43 

o 

1 

40 

21 

0.46 

-5 

42 

20 

0.73 

0 

19 

19 

0.80 

5 

22 

20 

0.67 

10 

36 

37 

0.52 

15 

38 

41 

0.43 


Entries in the hrst column are the values added to the second variate of the 25*'^ observation 


to make it an outlier. Entries in the second and third columns are the number of misclassi- 
hcations for the Gaussian and MPE mixture models, respectively. Lastly, the /3 values are 
also provided. 


round blue cell tumors (iKhan et all . l2nnih. A preprocessed v ersion of these data can be 
obtained from the plsgenomics package (IBoulesteix et ahl . 12014 1. The 83 samples are known 
to correspond to four classes, including 29 cases of Ewing sarcoma, 11 cases of Burkitt 
lymphoma, 18 cases of neuroblast oma, and 25 cases of rhabdomyosarcoma. The golub data 
contain gene expression data from iGolub et ahl (1l999l) on two forms of acute leukaemia: acute 
lymphoblastic leukaemia (47 observations) an d acute myeloid leukaemia (25 observations). 
The preprocessed data used in the analysis of McNicholas and Murphy ( 20101 ) are available 
at www.paulmcnicholas.info. Note that methodology proposed herein is not designed for 
high-dimensional, low sample size (i.e., large p, small N) problems — the development of 
factor analysis-based exte nsions of MPE mixture niodels, along the lines of t he mix ture 
of factor analyzers model ( Ghahramani and Hinton , 19971 : McLachlan and Peel . 2000a) an d 
extensions thereof (e.g., iMcNicholas and Murphyl . l2008l : [Andrews and McNicholasl . l201lh . 
will be a subject of future work. Hence, both of these bioinformatics data sets are further 
pre-processed to make the clustering problem more suitable for the methodology that is the 
subject of the present work. A differential expression analysis on the gene expression data 
is performed using an ANOVA across the known groups. The top ten genes, ranked using 
the obtained p-values, are selected to represent a potential set of measurements that contain 
information allowing for identihcation of the four groups. The three mixture model-based 
clustering algorithms were then run on these processed data. The ePEM family is run on the 
scaled data for G = 1,..., 5. Table |6] compares the performance of the methodologies run 
on these data; here, the predicted classihcations from the selected model (using the BIG) 
are compared to the true class labels in each case. 

Glearly, the ePEM family performs favourably compared to the mixture and teigen 
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Table 6: Comparison of three families of mixture models on benchmark data. 


Data 

ePEM 

mixture 

teigen 

body {p = 24, G = 2) 
diabetes (p = 3, G = 3) 
female voles (p = 7, G = 2) 
wine (p = 13, G = 3) 
srbct (p = 10, G = 4) 
golub (p = 10, G = 2) 

0.94 (2; EEEV) 

0.66 (3; VVVE) 

0.91 (2; EEEV) 

0.98 (3; EEEV) 

0.82 (4; VIIE) 

0.84 (2; EEIE) 

0.80 (3; EEE) 
0.66 (3; VVV) 
0.91 (2; EEE) 
0.68 (4; VVI) 
0.82 (4; VVI) 
0.47 (5; VVE) 

0.80 (3; EEEE) 
0.67 (3; VVVE) 
0.91 (2; EEEE) 
0.68 (4; VVIE) 
0.85 (4; VVIE) 
0.74 (2; VVIE) 

Dimensionality and the number of known groups (i.e. 

, classes) are in 

parenthesis following 


the name of each data set. For each family of models, the ARI, the number of components, 
and scale structure for the selected model are given in parenthesis. 


families. For the body data, the selected ePEM model hts a mixture of two heavy tailed 
components, i.e., /3 = (0.57,0.56)', that misclassihes eight cases (4 of each gender). The 
teigen family selects a model with 3 heavy-tailed components (23.43 degrees of freedom 
each), and the selected mixture model also hts three components. For the diabetes data, 
the selected models from all three families yield similar classihcations, each with a total of 
20 misclassihcations. The selected ePEM model has j3 = 1.07 in each component, suggesting 
components that are close to Gaussian. The selected teigen model also has relatively high 
(50.30) degrees of freedom in each component, implying component shapes that are close 
to Gaussian. For the female voles data, the selected models from all three families yield 
the same classihcation results, each with two misclassihcations. For the wine data, both the 
selected mixture and teigen models have four components, with 19.15 degrees of freedom in 
each component for the chosen teigen model. However, the selected ePEM model has three 
components, with f3 = (0.62, 0.59, 0.56)', and misclassihes only one observation, whereas the 
selected mixture and teigen models misclassify 35 and 34 observations, respectively. 

For the srbct data, the selected teigen model performs slightly better than the selected 
mixture and ePEM models. All selected models ht four components with the selected 
teigen, mixture, and ePEM models misclassifying 4, 5, and 5 observations, respectively. 
The selected teigen model has 15.53 degrees of freedom in each component, while the 
selected ePEM model has f3 = 0A2 in each component. 

Despite similar outcomes being obtained for the srbct data, the results diher greatly for 
the golub data. The selected mixture and teigen models have hve and two components, 
respectively. A referee asked us to comment on situations where the number of parameters 
approaches the number of observations. A restriction can be imposed such that only those 
models are htted that estimate fewer parameters than the number of observations in the 
sample. The selected hve-component mixture model has more parameters than there are 
observations. Restricting mixture to only those models with fewer parameters than the 
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number of observations, a three-component model is selected with an ARI value of 0.76. 
The selected ePEM model also has two components with j3 = 0.28 in each component, and 
yields a higher ARI than the selected teigen model, which has 5.80 degrees of freedom in 
each component. 

Overall, on these real data sets, the ePEM family outperforms the corresponding family 
of Gaussian mixtures and performs at least as well as the corresponding mixtures of t- 
distributions. Note that the BIG and the IGL picked the same ePEM model for all real 
data sets. We also ran these thre e algor i thms on other comr nonly used data in mode l- based 
clustering: the Swiss bank note (IFlurvl. 120121) and the iris (lAndersonl . Il935l: iFisherl. Il9361) 
data sets. On these data, the selected models from all three algorithms £t the same number 
of components and had approximately the same ARI values (results not shown). 


5 Discussion 


Afemily of MPE mixture models was proposed based on the density introduced in iGomez et al. 
(119981) . This expanded family of mixture models is introduced with a greatly improved pa¬ 
rameter estimation procedure as compared to the techniques proposed previously. This 
family of mixture models is unique in being able to deal with both lighter and heavier tails 
than the Gaussian distribution. Mixtures of f-distributions can only account for heavier than 
Gaussian tails and suffers when fitted to lighter tailed data. In such cases, both mixtures of 
f-distributions and mixtures of Gaussian distributions often fit more than the true number 
of components. Using simulations, we showed that the ePEM family is a good alternative to 
mixtures of Gaussian and mixtures of Student-f distributions, and that it is able to handle 
Gaussian, heavy-tailed, and light-tailed components. Moreover, these models also allow for 
different levels of peakedness of data: from thin to Gaussian to flat. Hence, these models 
are also well suited for density estimation purposes for a wide range of non-Gaussian data. 

Estimation is provided for eight scale structures that can be obtained through the use 
of eigen-decomposition of the scale matrix. Previously, mix tures of Gaussian and uni¬ 
form distributions have been fitted to accoun t for outliers ( Banfield and Raftervl. 1993 : 
Hennig and Gorettol . 2008 : Goretto and Hennig . 2010[) . In our framework, a uniform com¬ 
ponent can be conveniently approximated by restricting /3 to be high because the power 
exponential distribution becomes a multivariate generalization of the uniform distribution. 
This enables greater parsimony than a mixture of Gaussian and uniform distributions when 
htted to data with random noise, e.g., on mean-centred data, an EIIE model requires estima¬ 
tion of only one additional parameter. A mixture of skewed power exponential distributions 
will be a focus of future work; such a model will be better suited to modelling data with 
asymr netric clusters. Lastly, no te that the ePEM family has heavy fat tails for higher dimen¬ 
sions ( Liu and Bozdog^. 2008 1: therefore, a mixture of power exponential factor analyzers 
model may be useful for higher-dimensional data with outliers. 
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Appendix 


A Fixed-point algorithm 


Zhang and Lian^ fl201Clh used fixed po int iterative estimates for Note that the MPE 
density used in IZhang and Liang] (120101 ) can be obtained by setting S = 2A, r = 2^*, and s = 
/ 3*/2 i n (ITl). where A denotes the scale matrix in the para meterization ofIZhang and Liang 


(2010). We show that the estimation procedu re used in Zhang and Liangl f 2010l ) is valid 
only for /3* G (0,4), where (3* is defined as in Zhang and Lian j fl2010l) . A proof for this 
(see Appendix IA.1|) applies to /3 G (0,2], because of the different shape parameterizations, 
without loss of generality. In Figure [5], we present four comparisons of the trajectory of 
log-likelihood values between our proposed estimation and using fixed point iterations: for /3 
equaling 1.5, 1.9, 1.95, and 2.05, respectively. For all cases, 1000 observations were generated 
from a 2-dimensional zero-centred power exponential distribution. Only S is estimated, 
with the other parameters held constant. For both algorithms, S is initialized as an identity 
matrix. Clearly, as jS approaches 2, the log-likelihood values for the fixed point estimating 
procedure (red line) oscillate more heavily. This leads to non-monotonicity of the likelihood, 
complicating the determination of convergence. Moreover, notice that certain values of the 
log-likelihood for the fixed point are not plotted for f3 = 2.05—this is because of numerical 
errors. Note that each of the subplots in Figure [5] has two ordinate axes due to different 
scales of the values from each procedure. We also provide similar plots for f3 equaling 1.99 
and 2.05 for a 10-dimensional simulation (Figure [6]). The results are quite similar. We have 
conducted extensive simulations and, in every case, the log-likelihood values from the fixed 
point iterations diverge for (3 > 2. In most cases, the fixed point iterations do not even run. 
Because this is equivalent to (3* > 4, we conclude that the GEM approach is better than 
using hxed point iterations. Furthermore, note that for (3* < 2, the fixed point algorithm for 
an unconstrained scale matrix converges due to conca vity properties (similar to our VVV 
case for 0 < /3 < 1). Note that Zhang and Liang ( 20101) only deal with < 4 in their work. 


A.l Fixed point stability 

The fixed Doint algorithm from Izhane and LianeJ (l2010h diverges for /3>2 


17 











































OD 


p 

p 

CD 

EL 'n 

OT cr’ 

° s 


I—• ^ 

rt- CD 





CD 

w 


o O 

CD 

^ oq 


^ 1- 

P 

£- 

? 1 

P‘ 

(K? 

^ H-* 

B ° 

1—^ 

P o 

jui 

p CP 
w 

o’U 

1—^ 

jo 

P o 

03 (ri“ 
h— Cfi 

1—1 

CP d 

CD 

p 2 

o\ 

d 'q 


? o 

P 

p 

P 

H ^ 
^ Q 

o 

^ H 



CD 


W 

CD 

O 

o 

o 

o S" 

< 

d d 

1 ^ 

2- 

h-S D 




CT 

O 


P 

Pb 

X 

CD 

o 

p‘ 


p 

p 


CP 

p 


CP v> 
CD 

cr d- 
o 

^ CD 

c+ (/J 


^ P- 
Pk oq 


Log-likelihood using proposed GEM 
-1300 -1250 -1200 



Log-likelihood using fixed point iterations 


Log-likelihood using proposed GEM 
-1400 -1300 -1200 




-T"'-1-1-1-1-T 

-60000 -40000 -20000 


Log-likelihood using fixed point iterations 


Log-likelihood using proposed GEM 


-1560 -1520 -1480 -1440 





2! O 

g - IS 



-1438 -1434 -1430 -1426 


Log-likelihood using fixed point iterations 


Log-likelihood using proposed GEM 


-1350 -1300 -1250 -1200 



Log-likelihood using fixed point iterations 


































Figure 6: Log-likelihood plots for the proposed GEM procedure and hxed point-based esti¬ 
mating algorithms on 10-dimensional data. The left- and right-hand panels have (3 values of 
1.99 and 2.05, respectively. 


Proof. If X follows a p-dimensional power exponential distribution, the log-likelihood with 
respect to S is 

NG 

£(E) = 5^ 5^ - log |S|-‘ - - [(x, - - m)]^ 

i=l g=l 

Then, upon taking the derivative of with respect to we can obtain the hxed point 
update 

o 

/(S) = ^ [{Xi - {Xi - ^l){xi - /x)'. (5) 

i=\ 

Now, 


o J'' 

vec(/(S)) = [i^i - vec{{xi - /x) «) (xi - /x)). 

i=l 

Taking the derivative with respect to S, we get the Jacobian 

i=l 

X vec((a3j — /x) (g) {xi — fi)) vec(S“^(a3j — fi){xi — /x)'S“^)' 

i=l 

X vec((a3i — /x) 0 (xi — /x)) [vec(S“^ 0 S~^) vec((a:;j — /x) 0 {xi — 
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Then, 


tr(J) = ^ 

i=l 

X tr {(ajj - fj,){xi - /x)'S"^ (g) {xi - y){xi - 

2=1 

X tr {(ajj - ^){xi - /x)'S"^} tr {(a;* - ^){xi - 
= tr |(1 - ^ [(iCi - /x)'S^^(a;i - (a^^ - ^)(a;i - /x)'| . 


Evaluating tr(J) at S = S, we get tr(/p(l —/S)) = (1 — (3)p. Now, because the matrix norm 
||-B||p = tr(S^)^/^, II J||i = tr(J) = p(l — /9). Also, note that because || J||i < p||^||cx), and 
we require ||J||oo < 1 for stability (||J||oo = 1 for neutrality), we have 

ll^lll <p||J||oo <P- 

Hence, p{l — (3) < p, leading to 0 < /3 < 2. Therefore, the solution diverges for (3 > 2. □ 


B Inference 


The likelihood of the MPE mixture model is 

N G 


r 1 'i 

^0(015) = n ^ 9 K\'^g\~^ exp -- {{xi - ^i)[gi:-^{xi - ii)igY^ \ , 

i=l g=l ^ ' 

where kg is analogous to fc in (|2]), and {xi — pk)ig = Xi — pig. Note that S is considered incom¬ 
plete in the context of the EM algorithm. The complete-data are Sc = {(®i, ^i), • • •, (a^vr, ^w)}, 
where the missing data z* = (za, • • • ^Zic)' is the component label vector such that Zig = 

1 if Xi comes from the population and 0 otherwise. The complete-data log-likelihood 
Cc{@) = logLc(®|‘5c) can be written as 

N G 

^c(©) = X^X^^isfog 

i=l g=l 

where 5ig{xi) := Si (xi\pig, S^) = (a;, — pig)' (^Xg — pig). The E-step involves calculating 
the expected complete-data log-likelihood, which we denote Q. We need the expected values 


TTgkgl'Sgl 2 exp 


Sig{Xif<^ 


T~ig ■ 


^gf { Xilflg., ^gi jSg 


'Ylij = l^jf ( I i^j ) ^ J ) A 


( 6 ) 
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for i = 1,..., and g = 1,... ,G. The M-step on the {k + l)th iteration involves maxi¬ 
mization of the expected value of the complete-data log-likelihood with respect to ©. The 
update for Tig is 

kg = Ug/N, 

where Ug = ng. 

However, the updates for l3g, fig, and Eg are not available in closed form. A Newton- 
Raphson update is used to hnd the update for fig, and we need the following: 

(7) 

( 8 ) 


dQ 


N 


d^Q 


i=l 

N 


7 =4 ^ + (4 - l)5ig{Xif^ ^^g\xi - 


i=l 


where 6ig{xi) := (ccj — figY^Eg {xi — fig') and {xi — fi)ig = Xi — fig. An update for 4 can 
be obtained by solving the equation 


pUg 


new 

9 




P 


m 


+ 


new 

9 


PUg log 2 

ih' 


N 


~^Tig\^^g^ig{Xi)] {6ig{Xi)y 


jnew 


= 0 


(9) 


2=1 


for where '0(') is fhe digamma function. Alternatively, a Newton-Raphson method 
might be implemented using the following: 


dQ png 


dl3g 

d^Q 

df^l 


2/3,2 




9 

png 


p \ png log 2 

Wg 


N 


+ 


PI 


if 


1 + ^ 

2/^9 


24 

P^rig 

44 




X 


i=l 


Pi 


1 + J7 


^Og 5ig{x 
png log 2 

K 


( 10 ) 


N 


( 11 ) 


2 = 1 


where pi{-) is the trigamma function. Uptates for f3g when it is constrained to be equal 
between groups can be obtained similarly. Because the update for Eg is not available in 
closed form, we rely on convexity properties. For the updates for the EEI, VVI, EEE, EEV, 
VVE, and VVV scale matrices, we utilize a minorization-maximization step. Because of the 
properties of a minorization-maximization algorithm, this step increases the expected value 
of the complete-data log-likelihood at every iteration, thus making the estimation algorithm 
a generalized EM (GEM) algorithm. In addition, for the EEE, EEV, VVE, and VVV scale 
matrices, we utilize an accelerated line search method on the orthogonal Stiefel manifold (cf. 
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Absil et al.l. 120091: iBrowne and McNicholaa . l2014b|) . An MM algorithm can be constructed 
by using the convexity of the objective function—a surrogate minorizing function is employed 
that is maximized. Note that the surrogate function c onstructed in the E-step in an EM 
algorithm is, up to a constant, a minorizing function I Hnnter and Lange . 2004j) . For the 
Eli, VII, EEI, VVI, FEE, EEV, VVE, and VVV scale structures (as listed in Table [T]), the 
updates are discussed below. The pseudo-code for the estimation of parameters is: 

1. Initialize jSg, fig, Eg. Compute ([6]). 


2. Update /3g using either ([9]) or fflU]) and flTT]) : or flT^ or flT5]) and flTT|) . depending on 
whether f3g is unconstrained between groups or not. 


3. CM step 1: Update fi using ([7]) and 


4. CM step 2: Update Eg depending on the scale structure. 

5. Check for convergence. If not converged, go back to Step |2l 


B.l Shape parameter constrained between groups 

When /3g is constrained to be equal between groups, the update for (5 can be obtained by 
solving the equation 




+ 


pN log 2 



G N 

5^ Tspog 5ig{X,)]5ig{Xif 
3=1 i=l 


0 


( 12 ) 


for Alternatively, a Newton-Raphson method might be implemented using the follow¬ 
ing: 


m 

dp 

d^Q 

dp^ 


G N 


-pN 


P^ 


p{l + 


JL 

2P 


2/32 

4/34 V 2/3 


(13) 


g=l i=l 

P 


pN log 2 

P^ 


G N 


“ ^ kgixPf 5ig{Xif. 


3=1 i=l 


(14) 


B.2 Scale structure VVV 

Here, details are provided on estimation of the unconstrained scale matrix (VVV structure). 
On ignoring terms not involving Eg, we have 

N G 

2(^9) = ^ ISgi"^ - ^ {{xi - fiy.gi:p\xi - ^i)igY^. 

1=1 3=1 
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The updates differ based on the value of Denote = {xi — ^)ig{xi — where 

{x, - ^l)ig = Xi- 

fig’"'" G (0,1): Here, we borrow from the minorization-maximization framework for es¬ 
timation. Note that tr is concave for jig G (0,1), where tr(-) refers to the 

r — 1 'I 

trace. A surrogate function for trjSg ® can be constructed using the supporting 

hyperplane inequality: 


r new 


3 new_i 


1 -i flnew r ^ —I Pg f ■' —1 

trfsyM”-}"' <tr|S, M™! +,3™tr|S, M 


Then, the following is maximized: 

N G 

5:^-^iogiE,i-‘+ 


*9 


i=l g=l 


X 


, -1 1 

tr<!S^ M-' 


new 

ig 


/§new_i 


+/3rtr<|s/M-r' X 


tr{S;'M"7}-tr|s/M-} 


leading to the update 


/Onew ^ /3new_]^ 


(15) 


1 

^ 2 = 1 


/^new g [l,oo): Using the Jordan decomposition, = DgA^^D'^, where Dg is an or- 
thonormal matrix and Ag is a diagonal matrix of eigenvalues. Now, let S” = DgAg ® D'^ 

_l/onew 

where Ag ® = Ag. We obtain updates for both A’f"'" and Dg’"'". 

It follows that 


tr \ (Xi - fj,)'- bg\l^^^ D (Xi - fj,)ig 


anew 

^9 


= tr < n'- A 




ig g 


V 


*9 


3 new 
9 


,iV, 


where Vig = Dg{xi — ^)ig. Then, for i = 1, 


1/ffi 


/onew 

^9 


new 

9 

gh ^igh 


Kh=l 


where Ag = diag(Agi,..., Agp). This function is concave with respect to the eigenvalues 
Ag = {Agi,..., Agp} (cf. Weighted p-norm). A surrogate function is constructed using 

/(Ag)</(Ag) + (V/(Ag))'(Ag-Ag), 
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This can be simplified to 
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X I AgAg Vig - tr i VigXg Xg Vig 


where Vig = Vigv[ Now, let Wig = Wigw[ where Wig = v[ X 
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^9^ig\^ig^^9 


WinWL(wLWig)^'- 
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3 = (WinW.^ra 


/(Ag)<tr A, ^ F 


ig^ig 
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^new 

= W^^g . Now, 


ig g 


Also, note that here. 


ig 
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+ (tr ^ AgA^ A^ 




Then, the following is maximized: 
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On taking the derivative with respect to Ag, it can be shown that the update for Ag is 
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13. 
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new 
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1//3; 
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(16) 
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i=l 


where 


W = A A 

ig ^g * ig g ’ 


^ig ^ig^ig Hnd Vig Dg(^Xi fjXjig. 

Regarding the update for IDg (this is the same as Fg in Table [1]), an orthonormal matrix, 
we use an a ccelerated line search for optini ization on the orthogonal Stiefel manifold as 
employed by Browne and McNicholasI f 2ni4bl ). For minimizing a function of an orthonormal 
matrix, the search space is the orthogonal Stiefel manifold equal to the set of all orthonormal 
matrices M. = {X G : X'X = Ip}. The idea behind the line search method is to 

move along a specific search direction in the tangent space until the objective function is 


24 

















-1 AT 1 //Qnew 

reasonably decreased (jBrowne and McNicholasl . l2014bl) . Let Qg = Lg * ■ The 


objective function that needs to be minimized is 

G 


-1 


new 

9 


9=1 


with an unconstrained gradient 

-- / /. \-i 

grad/(i?,) = 2/3- QgDg ( A7) D' QgDg ( A' 


— Rg. 


As shown bv iBrowne and McNicholasl (j2014bl) . the direction of the steepest descent while in 
TxM. (the tangent space of X) at the position X is grad/(X) = Px (grad/(X)) , where 


:zj = z-x 


(X'Z + Z'X) 


is the orthogonal projection Px of a matrix Z onto TxM.- Hence, we get 


glMf{Dg) = Rg - -DgR!gDg - -DgD'gRg. 


2 ® ^ 


In order to obtain convergence, the step size t* is taken to be the Armijo step size (which 
guarantees convergence) and Dg is updated as 




-tl X giadfibg 


(17) 


where Rx is a retraction R at X. A retraction — a smooth mapping from the tangent 


space to the manifold — allows for searching along a curve in the manifold 


ing in the direction of the tangent vector). As in iBrowne and McNicholasl (2014blj. the 


while mov- 


QR decomposition-based retraction is used herein. See iBrowne and McNicholas 
details on the retraction and the Armijo step size. 


(1201414) for 


B.3 Scale structure VVI 

There are two solutions depending on the current estimate of /3g. Denote Af— = (ccj — 
n)ig{xi - n)[g, where (ccj - n)ig = Xi - /x—. 

^new g Ugjj^g Jorduu decompositiou, we can write = A~^, because Dg is 

an identity matrix. Recall that the VVI scale structure refers to a diagonal constraint such 
that = Ag^ = \g = diag(Agi,..., Xgp), where diag(-) denotes a diagonal matrix. Note 

that tr [Agixi - fj,)ig{xi - fj,)[g) ® can be written as {Y,h=i - ^J^gh) Xgh) ® . This is 
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a concave function with respect to the eigenvalues of the diagonal matrix. Then, a surrogate 
function can be constructed: 
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Then, we maximize: 
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On taking the derivative with respect to Ag, we obtain the update 
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( 18 ) 


^new g [l,cx)): Using the Jordan decomposition, we can write \ because Dg 

_ _ -^/pnew 

is an identity matrix. Let = Ag = A^ ® . Proceeding in a similar fashion to the Ag 
update in the VVV G (1, cx))) case, we can get the update 
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W ^ 

ig 



where JU,, = 


y ^ 
^g ■ 


(19) 


B.4 Scale structure VVE 

There are two solutions depending on the current estimate of f3g. We use similar ideas as 
in the VVV and VVI cases. Denote = {xi — fj,)ig{xi — fi)[g, where {xi — fi)ig = a;, — 
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/^new g (0,1): Using the Jordan decomposition, we write ^ = DA^^D'. Now, let 
= DAgD', where A“^ = Ag. Proceeding as before, a surrogate function can be 
constructed such that 


^ , /Qne 

tT{\gVigf^ 


<tr <i AgVig 


^new 


+ ^rtT\v'^AgV,g 


^new_i 


X 


tr {AgVjg} - tr AgV. 


ig 


where Vig = D{xi — fi)ig and Vig = VigVig. Then, we maximize: 

N G 


log I Ag 


Ti 


*9 


i=l g=l 


tl{ AgVig 


inew 
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+ «“triA,F„ 


^new_i 


X trjAgViJ -trMgV, 


*9 


On taking the derivative with respect to Ag, we obtain the update 
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new 
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/^new g [l,cxo): Using the Jordan decomposition, we write = DAg^D'. Now, let 


^-1 


1 / anew / one 

= DAi^^ D', where Ag 


= Ag. Proceeding in a similar fashion to the A 


new 

9 


update in the VVV (/Jg®* G (1, cxo)) case, we can get the update 


'' new 

Ag ~ 
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( 21 ) 


/ , 


_1 ^ _1 

where Wig = A/VigA/, Vig = D {xi - p)ig and Vig = VigV[g. 


The update for i.e., Dg constrained to be equal across groups (same as P in Tabled]), 
is similar to the update for in the VVV model. We again use an accelerated line search 
for optimization on the orthogonal Stiefel manifold as employed by iBrowne and McNicholas 


(2014b). Let Q = ■ The objective function that needs to be minimized is 


g ■ ig 

nD)^Y.U^AQ,D (a 


new 
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-1 


D' 


/bnew 
^9 


, with an unconstrained gradient 


grad/(D) = ^ 24 ™(Q^D(A;") D'l ’ (A) 

9=1 


-1 


— Rg. 
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As shown in lBrowne and McNicholasI fl2014bl) . the direction of the steepest descent while in 
TxM. (the tangent space of X) at the position X is grad/(X) = Px (grad/(X)) , where 
Px {Z) = Z — is the orthogonal projection Px of a matrix Z onto TxM. 

Hence, we get 

G ^ G ^ G 

grad/(D) = 5^ R» - 5 5^ D%D -DD'R,. 

3=1 3=1 3=1 

To obtain convergence, the step size t* is taken to be the Armijo step size (which guarantees 
convergence) and D is updated as 


D =Rx 


-t*k X grad/(£>) 


( 22 ) 


where Ry is a retraction R at X. As befor e, we use the QR decomposition-based retraction, 
similar to iBrowne and McNicholasI fl2014bh . 


B.5 Scale structure VII 

Recall that the VII scale structure refers to an isotropic constraint such that = Xglp. 
Then, on ignoring terms not involving we have 
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Setting the derivative with respect to ^ to 0 yields 
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B.6 Scale structure Eli 

Recall that the EH scale structure refers to an isotropic constraint such that = S = Alp. 
Then, 
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Setting the derivative with respect to A ^ to 0 yields 
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Hence, A"®" can be found by solving the equation 
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( 24 ) 


B.7 Scale structure EEE 


Here, we provide details on estimation of the scale matrix when it is constrained between 
groups. Denote M"®" = {xi - n)ig{xi - where {xi - fi)ig = Xi - On ignoring 

terms not involving S, 
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i=l g=l 
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The updates differ based on the current value of Pg. 

\/g G (1...G) /3“®"' G (0,1): Using the Jordan decomposition, we can write = 
DA-^D' = DAD', where D is an orthonormal matrix, A is a diagonal matrix of eigen¬ 
values, and A = A~^. We obtain updates for both A and D . Using similar ideas as 
before, we can construct a surrogate function: 


tT{\VigY<’ < trMV 
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where Vig 
obtained 



Vigv'^g. Then, using the above, an estimate can easily be 
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(25) 


3g G (1,...,G) such that /5;®* G [l,cx)): Let = DA ^D' = DA^/^*D', where 
A“^/^ = A, (3* = inax(/Si, • • •, Pg) and (3* > 1. Note that 

where A = diag(Ai,..., Ap). This function is concave with respect to the eigenvalues A 
(similar to a composition of a weighted p-norm and a variable raised to a power less than or 
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equal to 1). Then, the following update can be obtained by proceeding in a similar fashion 
to the VVV case: 
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(/3*) 


i/(Tr 


_ /onew /V i 

" wf; A 
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where Vig = D {xi - fj,)ig, Vig = Vigv'-g, and Wig = A ^VigA \ 

The update for D is similar to the update for the VVE model. Let 
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g 71 /f-new 


^9 
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and the objective function that needs to be minimized now is 

fiD) = J2^rlQ^n (A”y £>'''* 
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B.8 Scale structure EEI 

The estimate for the diagonal matrix of eigenvalues S in the EEI case can be derived using 
ideas similar to the EEE and VVI case. We obtain 
G (1...G) e (0,1): 


s = 
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G N 
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new 
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3(7 G (1,..., G) such that G [1, cxo): 


G 
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{p*r 
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where Wig = i: 


(28) 


B.9 Scale structure EEV 

The estimate for Hg in the EEV case can be derived using ideas similar to the EEE and 
VVV case. 
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(l.-.G) e (0,1): 
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where Vig = Dg{xi - /x)*^, Vig = Vigv'^g. 

3(7 G (1,..., G) such that G [1, oo)\ 
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5=1 


2=1 


where = Dg{xi - ^i)ig, V ig = Vigv[g, and Wig = A Wig A 


j.r 1 //onew 

The update for D is similar to the EEE and VVV models. Let Qg = X]i=i '''ig * ■ 

The objective function that needs to be minimized now is 

G 


/□new 

^9 


f{D,) = J2^v\Q,D,(A’") £>; 

9=1 


C Initialization, model selection, and performance as¬ 
sessment 

C.l Model selection and initialization 

In model-based clustering applications, it is common to ht each member of a family of 
mixture models for a range of values of G, out of which a ‘best’ model is chosen based on 
some likelihood-based criterion. Note that this best model does not nece s sarily correspond to 
optimal clustering. The Bayesian information criterion (BIG; Schwarz . 1978h is commonly 
used for mixture model selection. Even though the regu larity properties ne eded for the 
development of the BI G are not satished by m i xture models flKeribinl.ll998l.l2000li . it has been 
used extensively (e.g.. iDasgupta and Raftervl. Il998l: iFralev and Raftervl. 120021) and performs 
well in practice. The BIG can be computed as 


BIG = 2/(0) -m log iV, 

where /(0) is the maximized log-likelihood, m is the numb er of free parameters, and N is 
the sample size. The integrated completed likelihood (IGL; iBiernacki et al.l. 120001) aims to 
correct the BIG by putting some focus on the clustering performance. This is done via the 
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estimated mean entropy, which reflects the uncertainty in the classiflcation of observations 
into components. The ICL can be computed via 


N G 


ICL BIG + EE MAP(r,,)logr,„ 

i=l g=l 

where MAP(rig) is the maximum a posteriori probability, equaling 1 if max/i=i^,,,_G(ri/i) occurs 
at component h = g, and 0 otherwise. 

Because the EM algorithm is iterative, initial values are needed for the parameters. The 
issue of starting values is important because the performance of the EM algorithm is known to 
depend on the sta rting values. Poor s tartin g values can result in singularities or convergence 
to local maxima flTitterington et al.L 119851). Some techniques that can allevi ate such issues 
are constr aining eigenvalues flingrassia and Roccil. 120071: iBrowne et al.L l2013l) , deterministic 
annealing fjZhou and Langd. l2010l) . or picking a run from multiple starts for the EM. The 
algorithm can be initia lized based on a random a ssignment of data points to components, 
on /c-means clustering ( Hartigan and Wongl. 1979), on some hierarchical clustering method, 
or in some other way. We constrain (3g to be less than 200 for numerical stability—this is 
similar to how the degrees of fre edom paramete r in rn ixtures of f-distributions is sometimes 
constrained to be less than 200 (lAndrews et al.L 1201111 . 


C.2 Convergence criterion 


Here, a stopping criterion based on Aitken’s acceleration flAitkenl . 1192611 is used to determine 
convergence. The commonly used lack of progress criterion can converge earlier than the 
Aitken’s stopping criterion, resulting in estimates that might not be close to the maximum 
likelihood estimates. The Aitken acceleration at iteration k is 


[k) - 




lik) 


o'- ' = 


lik) _ /(fc-1)’ 


where is the log-likelihood value from iteration k. An asymptotic estimate of the log- 
likelihood at iteration k + 1 can be computed via 

1 


= /(^) + 


1 - a(^) 


(/new _ 


(IBohning et al.L ll994J ) . Convergence is assu med to have been reached when l y'" — < 


provided that this difference is positive (cf. iLindsavl . Il995l: iMcNicholas et al.L 120101) . Note 
that we use e = 0.005 herein. 


C.3 Performance assessment 


The adjusted Rand index (ARI; iHnbert and Arabiel . Il985h is used for determining the per¬ 
formance of the chosen model by comparing predicted classifications to true group labels. 


32 






















































when known. The ARI corrects the Rand index flRandl. Il97ll) to account for chance when 
calculating the agreement between true labels and estimated classihcations. An ARI of 1 
corresponds t o perfect agree ment, and the expected value of the ARI is 0 under random 
classification. Steinlev f 2004j) provides a thorough evaluation of the ARI. 


Table 7: Time taken in seconds to run all sixteen models (based on un-optimized code) for 
the real data examples for G = 1,..., 5. 


Data 

Time taken (seconds) 

body (p = 24, G = 2, AT = 507) 

19151 

diabetes {p = 3, G = 3, N = 145) 

310 

female voles (p = 7, G = 2, N = 86) 

291 

wine (p = 13, G = 3, A^ = 178) 

2326 

srbct (p = 10, G = 4, A^ = 83) 

1101 

golub (p = 10, G = 2, A^ = 72) 

405 


Dimensionality, the number of known groups (i.e., classes), and the number of sample points 
are in parenthesis following the name of each data set. 
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